Visualisation with ggplot
Etienne Côme
October, 29 2024
Visualisation ?
“Transformation of the symbolic into the geometric” [McCormick et
al. 1987]
“… finding the artificial memory that best supports our natural means
of perception.” [Bertin 1967]
“The use of computer-generated, interactive, visual representations
of data to amplify cognition.” [Card, Mackinlay, & Shneiderman
1999]
Why visualize?
Integrating the human in the
loop
Answer questions or find questions?
Making decisions
Putting data in context
Amplify the memory
Graphic calculation
Find schematics and patterns
Presenting arguments
Why visualize?
Analyze :
Developing and criticizing hypotheses
Discovering errors
Find patterns
Communicate
Sharing and convincing
Collaborate and review
Anscombe quartet
1
9
7.500909
3.316625
2.031568
2
9
7.500909
3.316625
2.031657
3
9
7.500000
3.316625
2.030424
4
9
7.500909
3.316625
2.030578
Anscombe
quartet
Cholera map (John Snow)
Visualization
=
encode the data using
visual chanels
Visual
channels
Bertin Jacques, Sémiologie graphique, Paris, Mouton/Gauthier-Villars,
1967.
Visual
channels
Bertin Jacques, Sémiologie graphique, Paris, Mouton/Gauthier-Villars,
1967.
Marks // visuals channels
Marks :
graphical building blocks
Visual channels :
The visual properties that varie
Marks, visual channels
Marks, visual channels
All channels
are not equals
Marks, visual channels
The best channels depend on the feature type
(continuous, categorical, ordinal,…)
Marks, visual channels
The interesting part is not already available
pre-attentive processing
How many 3 ?
1281768756138976546984506985604982826762
9809858458224509856458945098450980943585
9091030209905959595772564675050678904567
8845789809821677654876364908560912949686
pre-attentive processing
How many 3 ?
128176875613 8976546984506985604982826762
9809858458224509856458945098450980943 585 909103 0209905959595772564675050678904567
88457898098216776548763 64908560912949686
pre-attentive
processing
pre-attentive
processing
pre-attentive
processing
library (ggplot2)
ggplot (mpg)+ geom_point (aes (x= cty,y= hwy,color= class))
Questions ? Features types ?
continuous ? discretes
? ordinals ? temporal ? spatials ?
Some categories
and
one quantity for each modality
le bar chart
library (rjson)
library (dplyr)
?mpg
dataset mpg
The bar chart
m_cty = mpg %>% group_by (manufacturer) %>% summarize (mcty= mean (cty))
ggplot (data= m_cty)+
geom_bar (aes (x= manufacturer,y= mcty),stat = 'identity' )+
scale_x_discrete ("Manufacturer" )+
scale_y_continuous ("Miles / Gallon (City conditions)" )
Order ?
m_cty_ordered = m_cty %>% arrange (desc (mcty)) %>%
mutate (manufacturer= factor (manufacturer,levels= manufacturer))
ggplot (data= m_cty_ordered)+
geom_bar (aes (x= manufacturer,y= mcty),stat = 'identity' )+
scale_x_discrete ("Manufacturer" )+
scale_y_continuous ("Miles / Gallon (City conditions)" )
Horizontal ?
ggplot (data= m_cty_ordered)+
geom_bar (aes (x= manufacturer,y= mcty),stat = 'identity' )+
scale_x_discrete ("Manufacturer" )+
scale_y_continuous ("Miles / Gallon (City conditions)" )+
coord_flip ()
The ligne :
1 numeric variable
with respect
to time
Vélib’ data :
url= "./data/sp_Lyon.json"
library (dplyr)
# read some data
data= fromJSON (file= url)
# to data.frame
extract = function (x){
data.frame (id= x$ '_id' ,
time= x$ download_date,
nbbikes = x$ available_bikes )
}
st_tempstats.df= do.call (rbind,lapply (data,extract))
tempstats.df= st_tempstats.df |> group_by (time) |> summarise (nbbikes = sum (nbbikes))
Time, natural order
ggplot (data= tempstats.df,aes (x= time,y= nbbikes))+ geom_point ()
Time, natural order
ggplot (data= tempstats.df,aes (x= time,y= nbbikes))+ geom_line ()
Aspect ratio
ggplot (data= tempstats.df,aes (x= time,y= nbbikes))+ geom_line ()
Aspect ratio
ggplot (data= tempstats.df,aes (x= time,y= nbbikes))+ geom_line ()
Aspect ratio
ggplot (data= tempstats.df,aes (x= time,y= nbbikes))+ geom_line ()
Aspect ratio, 45°
Heuristic: use the aspect ratio that results in an average line slope of
45°.
Cleveland, William S., Marylyn E. McGill, and Robert McGill. “The shape
parameter of a two-variable graph.” Journal of the American Statistical
Association 83.402 (1988): 289-300.
Area + Scale
ggplot (data= tempstats.df,aes (x= time,y= nbbikes))+ geom_area ()
Point of view
ggplot (data= tempstats.df,aes (x= time,y= max (nbbikes)- nbbikes))+
geom_area ()
1 numeric variable
with respect
to time
+ categories
Velib data per stations
# read data and pre-processing
url = "./data/sp_Lyon.json"
data= fromJSON (file= url)
extract = function (x){
data.frame (id= x$ '_id' ,
time= x$ download_date,
nbbikes = x$ available_bikes )
}
st_tempstats.df= do.call (rbind,lapply (data,extract))
sel = st_tempstats.df %>% select (id) %>% unique () %>% sample_n (8 ) %>% pull ()
# selection de quelques stations
st_tempstats_sub.df = st_tempstats.df %>%
filter (id %in% sel)
Multiple line charts
ggplot (data= st_tempstats_sub.df)+
geom_line (aes (x= time,y= nbbikes,group= id,color= factor (id)),size= 2 )
Small multiples
ggplot (data= st_tempstats_sub.df)+
geom_line (aes (x= time,y= nbbikes,group= id,color= factor (id)),size= 2 )+
facet_grid (id ~ .)
2 numeric features
+ categories
Scatter plot + colors
mpg_su = mpg %>%
filter (class %in% c ('compact' ,'suv' ,'pickup' ,'minivan' ))
ggplot (mpg_su)+ geom_point (aes (x= cty,y= hwy,color= class))
Scatter plot + symbols
mpg_su = mpg %>%
filter (class %in% c ('compact' ,'suv' ,'pickup' ,'minivan' ))
ggplot (mpg_su)+ geom_point (aes (x= cty,y= hwy,shape= class))
3 numeric features (with one >0)
+ categories
Scatter plot + color + size
ggplot (mpg_su)+ geom_point (aes (x= cty,y= hwy,color= class,size= displ))
Scatter plot + color + size ! scales
ggplot (mpg_su)+ geom_point (aes (x= cty,y= hwy,color= class,size= displ))
Circle size : radius or area ?
Rayon
Aire
Principle :
\[\textrm{Lie factor} =
\frac{\textrm{visual effect size}}{\textrm{data effect
size}}\]
Lie factor :
\[\textrm{data effect size} = \frac{27.5 -
18}{18} \times 100 = 53 \%\]
Edward Tufte, The Visual Display of Quantitative Information, Cheshire,
CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)
Lie factor :
\[\textrm{visual effect size} = \frac{5.3
-0.6}{0.6} \times 100 = 783 \%\]
Edward Tufte, The Visual Display of Quantitative Information, Cheshire,
CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)
Lie factor :
\[\textrm{Lie factor} = \frac{783}{53} =
14.8\]
Edward Tufte, The Visual Display of Quantitative Information, Cheshire,
CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)
Lie factor : 9.4
Edward Tufte, The Visual Display of Quantitative Information, Cheshire,
CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)
knowing that the “apple”” area (in green ) is equal to \(2.22\,cm^2\) and that the rim area (in
blue) is equal to \(2.96\,cm^2\)
compute the lyong factor ?
Perception
\[S = I^p\]
Principle :
Increase the data density
\[\textrm{graph data density} =
\frac{\textrm{number of entries in data matrix}}{\textrm{area of data
display}} \]
Data density :
Avoid graphics with low data density
Edward Tufte, The Visual Display of Quantitative Information, Cheshire,
CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)
Data density :
Avoid graphics with low data density
Edward Tufte, The Visual Display of Quantitative Information, Cheshire,
CT, Graphics Press, 2001, 2e éd. (1re éd. 1983)
Principle :
Increase the data-ink ratio
\[\textrm{data-ink ratio} =
\frac{\textrm{area of data-ink}}{\textrm{total area of
ink}}\]
Data-ink ratio :
Recap
Avoid misleading graphics !
Avoid empty graphics
Be parsimonius with ink
Scales !, (!colors, !size)
Use explicit labels and
! categorial features and order
aspect ratio
filetype pdf, svg // png,jpg
ggplot
gg = grammar of graphics
“The Grammar of Graphics” (Wilkinson, Annand and Grossman, 2005)
grammar → same language for all figures
ggplot
building blocks of the grammar
the coordinate system
data and aesthetic mappings , ex : f(data) →
x position, y position, size, shape, color
geometric objects , ex : points, lines,
bars, texts
scales , ex : f([0, 100]) → [0, 5] px
facet specification , ex : split the data
into several plots
statistical transformations , ex : average,
coounting, regression
ggplot
Make a graphic :
add several layers
with their own visual encoding and possibly their own data
(+ optionel) add statistical transformation
(+ optionel) change scale options
(+ optionel) specify title, theme, guides, style …
! data = tidy data.frame with the right feature types
ggplot, géométries
Make a graphic :
add several layers
+geom_line()
with their own visual encoding and possibly their own data
aes(x=a,y=b,...)
Exemple
ggplot(mpg)+
geom_point(aes(x=cty,y=hwy,color=manufacturer,shape=factor(cyl)))
ggplot(mpg,aes(x=cty,y=hwy,color=manufacturer,shape=factor(cyl)))+
geom_jitter()
ggplot
ggplot (mpg,aes (x= cty,y= hwy,color= class))+ geom_point ()
ggplot
ggplot (mpg,aes (x= cty,y= hwy,color= class))+ geom_jitter ()
ggplot
ggplot (mpg,aes (x= cty,fill= class))+ geom_histogram (binwidth= 2 )
ggplot
ggplot (mpg,aes (y= cty,x= class))+ geom_violin ()
ggplot, scales
Make a graphic :
add several layers
+geom_line()
with their own visual encoding and possibly their own data
aes(x=a,y=b,...)
(+ optionel) change scale options
scale_fill_brewer(palette=3,type="qual")
scale_x_continuous(limits=c(0,45),breaks=seq(0,45,2))
ggplot, scales
ggplot (mpg,aes (x= cty,y= hwy,color= manufacturer,shape= factor (cyl)))+
geom_jitter ()+
scale_x_continuous (limits= c (0 ,45 ),breaks= seq (0 ,45 ,2 ))
ggplot, faceting
Make a graphic :
add several layers
+geom_line()
with their own visual encoding and possibly their own data
aes(x=a,y=b,...)
(+ optionel) change scale options
scale_fill_brewer(palette=3,type="qual")
scale_x_continuous(limits=c(0,45),breaks=seq(0,45,2))
use facet ?
facet_grid(. ~ cyl)
ggplot, faceting
ggplot (data= mpg,aes (x= hwy,y= cty,color= class))+
geom_point ()+
facet_wrap (~ year)
ggplot, stats
Make a graphic :
add several layers
+geom_line()
with their own visual encoding and possibly their own data
aes(x=a,y=b,...)
(+ optionel) change scale options
scale_fill_brewer(palette=3,type="qual")
scale_x_continuous(limits=c(0,45),breaks=seq(0,45,2))
add statistics
stat_density2d()
ggplot
ggplot (mpg,aes (y= cty,x= hwy))+
geom_point (color= "blue" )+ stat_density2d ()
ggplot
ggplot (mpg,aes (y= cty,x= hwy))+
geom_point (color= "blue" )+ stat_smooth ()
ggplot
library (hexbin)
ggplot (mpg,aes (y= cty,x= hwy))+
stat_binhex ()
Exercises
Update the scale and labels
# téléchargement et remise en forme des données
url = "./data/sp_Lyon.json"
data= fromJSON (file= url)
extract = function (x){
data.frame (id= x$ '_id' ,
time= x$ download_date,
nbbikes = x$ available_bikes )
}
st_tempstats.df= do.call (rbind,lapply (data,extract))
# selection de 3 stations
st_tempstats_sub.df = st_tempstats.df %>%
filter (id %in% sel)
ggplot (data= st_tempstats_sub.df)+
geom_line (aes (x= time,y= nbbikes,group= id,color= factor (id)),size= 2 )+
facet_grid (id ~ .)
Exercises
Update the scale and labels
Exercises
Reproduce this graphic (Iris data)
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): aucun argument trouvé pour min ; Inf est renvoyé
## Warning in max(x): aucun argument pour max ; -Inf est renvoyé
## Warning: `stat_contour()`: Zero contours were generated
## Warning in min(x): aucun argument trouvé pour min ; Inf est renvoyé
## Warning in max(x): aucun argument pour max ; -Inf est renvoyé
Exercices
Reproduce this graphic (mtcars data) ! modifier le theme du graphique
?theme
Exercises
Reproduce this graphic
Exercises
Reproduce this graphic Informations :
Bike sharing data from lyon (data folder)
Compute the occupancy rate nb bikes / max(nb bikes)
pivot to wide
do a k-means with 8 clusters X (rows = stations, column = time slot)
facet + mean curve + alpha blending